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'^^ , We introduce a quantile-adaptive framework for nonlinear vari- 

(-y-j ' able screening with high-dimensional heterogeneous data. This frame- 

work has two distinctive features: (1) it allows the set of active vari- 
ables to vary across quantiles, thus making it more flexible to accom- 
pH ' modate heterogeneity; (2) it is model-free and avoids the difficult 

^^ ^ task of specifying the form of a statistical model in a high dimen- 

sional space. Our nonlinear independence screening procedure em- 
ploys spline approximations to model the marginal effects at a quan- 
2 . tile level of interest. Under appropriate conditions on the quantile 

functions without requiring the existence of any moments, the new 
procedure is shown to enjoy the sure screening property in ultra-high 
dimensions. Furthermore, the quantile-adaptive framework can natu- 
rally handle censored data arising in survival analysis. We prove that 
the sure screening property remains valid when the response variable 
is subject to random right censoring. Numerical studies confirm the 
fine performance of the proposed method for various semiparametric 
^s^ [ models and its effectiveness to extract quantile-specific information 

' . from heteroscedastic data. 

o 

^f) • 1. Introduction. We consider the problem of analyzing ultra-high di- 

mensional data, where the number of candidate covariates (or features) may 
increase at an exponential rate. Many efforts have been devoted to this 
challenging problem in recent years, motivated by modern applications in 
genomics, bioinformatics, chemometrics, among others. A practically ap- 



C^_' pealing approach is to first use a fast screening procedure to reduce the 



Received February 2012; revised January 2013. 

^Supported in part by NSF Grant DMS-10-07396, NIH Grant R01GM080503 and Na- 
tional Natural Science Foundation of China Grant 11129101. 

^Supported in part by NSF Grant DMS-10-07603. 

^Supported in part by a Baruch College Eugene M. Lang Fellowship. 

AMS 2000 subject classifications. Primary 68Q32, 62G99; secondary 62E99, 62N99. 

Key words and phrases. Feature screening, high dimension, polynomial splines, quan- 
tile regression, randomly censored data, sure independence screening. 



This is an electronic reprint of the original article published by the 

Institute of Mathematical Statistics in The Annals of Statistics, 

2013, Vol. 41, No. 1, 342-369. This reprint differs from the original in pagination 

and typographic detail. 

i 



2 X. HE, L. WANG AND H. G. HONG 

dimensionality of the feature space to a moderate scale; and then apply 
more sophisticated variable selection techniques in the second stage. In this 
paper, we propose a new quantile-adaptive, model-free variable screening 
procedure, which is particularly appealing for analyzing high dimensional 
heterogeneous data and data with censored responses. 

Fan and Lv (2008) proposed the sure independence screening (SIS) method- 
ology for linear regression which screens variables by ranking their marginal 
correlations with the response variable. They established the desirable sure 
screening property, that is, some important features are retained with prob- 
ability approaching one, even if the dimensionality of the features is allowed 
to grow exponentially fast with the sample size. Fan and Song (2010) fur- 
ther extended the methodology to generalized linear models, see also Fan, 
Samworth and Wu (2009). The problem of nonlinear features screening was 
addressed in Hall and Miller (2009) using generalized correlation ranking 
and more systematically in Fan, Feng and Song (2011) using nonparametric 
marginal ranking, which extended the scope of applications of sure indepen- 
dence screening. Biihlmann, Kalisch, and Maathuis (2010) introduced the 
new concept of partial faithfulness and proposed a computationally efhcient 
PC-algorithm for feature screening in linear models. 

Zhu et al. (2011) proposed a novel feature screening procedure which 
avoids the specification of a particular model structure. This model-free 
screening framework is very appealing because a misspecified model could 
easily corrupt the performance of a variable selection method. Partly moti- 
vated by this interesting piece of work, we propose a new framework called 
quantile-adaptive model-free screening. We advocate a quantile-adaptive ap- 
proach which allows the set of active variables to be different when mod- 
eling different conditional quantiles. This new framework provides a more 
complete picture of the conditional distribution of the response given all 
candidate covariates and is more natural and effective for analyzing high- 
dimensional data that are characterized by heteroscedasticity. 

In the quantile-adaptive model-free screening framework, we estimate 
marginal quantile regression nonparametrically using S-spline approxima- 
tion. In this aspect, our technique shares some similarity with that in Fan, 
Feng and Song (2011). The main technical challenge is to deal with the nons- 
mooth loss function, because the nonparametric marginal utility we consider 
does not have a closed form expression as in Fan, Feng and Song (2011). We 
derive useful exponential bounds using the empirical process theory to es- 
tablish the sure screening property. When working with marginal quantile 
regression, the usual sub-Gaussian tail type condition in high-dimensional 
analysis can be relaxed and replaced by the assumption that the conditional 
density function of the random error has a positive lower bound around 
the quantile of interest. Empirically, we also demonstrate that the proposed 
procedure works well with heavy-tailed error distributions. 
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Sure independence screening remains challenging and little explored when 
the response variable is subject to random censoring, a common problem in 
survival analysis. Fan, Feng and Wu (2010) extended the methodology of 
sure independence screening using the marginal Cox proportional hazards 
model and studied its performance empirically. In this paper, we demon- 
strate that in the quantile-adaptive model-free screening framework, ran- 
domly censored responses can be naturally accommodated by ranking a 
marginal weighted quantile regression utility. We establish the sure screen- 
ing property for the censored case under some general conditions. 

The rest of the paper is organized as follows. In Section 2, we introduce 
the quantile-adaptive model-free feature screening procedure. In Section 3, 
we investigate its theoretical properties. Section 4 discusses the extension to 
survival analysis. In Section 5, we carry out simulation studies to access the 
performance of the proposed method. The numerical results demonstrate 
the favorable performance of the proposed method, especially when the er- 
rors are heavy-tailed or heteroscedastic. In Section 6, we demonstrate the 
application on a real data example. Section 7 contains further discussions. 
The technical details are given in Section 8. 

2. Quantile-adaptive model-free feature screening. 

2.1. A general framework. We consider the problem of nonlinear variable 
screening in high-dimensional feature space, where we observe a response 
variable Y and associated covariates Xi,. . . ,Xp. The goal is to rapidly 
reduce the dimension of the covariate space p to a moderate scale via a 
computationally convenient procedure. Since ultra-high dimensional data 
often display heterogeneity, we advocate a quantile-adaptive feature screen- 
ing framework. More specifically, we assume that at each quantile level a 
sparse set of covariates are relevant for modeling Y, but allow this set to 
be different at different quantiles, see, for instance. Examples 2 and 3 in 
Section 5. At a given quantile level a (0 < a < 1), we define the set of active 
variables 



T 



Ma = {j : QaiY\X.) functionally depends on Xj}, 

where Qa{Y\'X.) is the rth conditional quantile of y given X = {Xi, . . . ,Xp) 
that is, Qa(Y\X.) = inf{y : P(Y < ?/|X) > q}. Let Sa = \Ma\ be the cardinal- 
ity of Ma- Throughout this paper, we assume Sq, < a < 1, is smaller than 
the sample size n. 

In practice, we may consider several quantiles to explore the sparsity 
pattern and the effects of the covariates at different parts of the conditional 
distribution. We refer to Koenker (2005) for a comprehensive introduction 
to quantile regression. 

In ultrahigh dimensional data analysis, there generally exists little prior 
information for specifying a statistical model. Given a large number of co- 
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variates, it is hard to determine which covariates have Unear effects and 
which have nonUnear effects. In our framework besides the sparsity assump- 
tion, we do not impose a specific model structure but ahow the covariate 
effects to be nonlinear. 

2.2. Ranking by marginal quantile utility. Let {(X.i,Yi),i = 1, . . . ,n} be 
i.i.d. copies of (X,y), where Xj = {Xn, . . . ,Xip)'^. Note that 

y and Xj are independent <^ QaiY\Xj) - Q^iY) = VaG(0, 1), 

where Qa{Y\Xj) is the ath conditional quantile of Y given Xj and QaiY) 
is the ath unconditional quantile of y. To estimate the effect of Xj on Y, 
we consider the marginal quantile regression of Y on Xj. Let fj{Xj) = 
argminjE[/9Q,(y — f{Xj)) — pa{Y)], where the inclusion of Pa{Y) makes 
the expectation well defined even when Y has no finite moment, where 
Paiu) = u{a — I{u < 0)} is the quantile loss function (or check function). It 
is known that fj(Xj) = Qa(Y\Xj), the ath conditional quantile of Y given 
X,. 

Without loss of generality, we assume that each Xj takes values on the 
interval [0,1]. Let F be the class of functions defined in condition (CI) in 
Section 3.1. Let = sq < si < • • • < s^ = 1 be a partition of the interval. Using 
the Si as knots, we construct N = k + l normalized i?-spline basis functions 
of order / + 1 which form a basis for F. We write these basis functions as 
a vector 7r(t) = {Bi{t), . . . , i?Ar(t))-^, where ||-Ba:(")IIoo < 1 and || • ||oo denotes 
the sup norm. Assume that fj{t) £ F. Then fj{t) can be well approximated 
by a linear combination of the basis functions 7r(i) /3, for some (3 G M . 

Let 0j = argmin^gjjiv Y17=i PaO^i ~ '^i^ij)'^P)i and define 

l,{t) = 7T{tf^j-Fyl{a), 

where Fy^(a) is the ath sample quantile function based on Yi, ... ,Yn. Thus 

fnj{t) is a nonparametric estimator of Qa{Y\Xj) — QaiY). We expect /„j 
to be close to zero if Xj is independent of Y. 

The independence screening is based on J;he magnitude of the estimated 
marginal components WfnjWn — ^~^'l27=ifnj{Xij)'^. More specifically, we 
will select the subset of variables 

Ma = {l<3<p:\\fnj\\l>l^n}, 

where i^n is a predefined threshold value. In practice, we often rank the 
features by ||/nj||n and keep the top [n/log(n)] features, where [a] denotes 
the integer part of a. 

3. Theoretical properties. 

3.1. Preliminaries. We impose the following regularity conditions to fa- 
cilitate our technical derivations. 
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(CI) The conditional quantile function Qa(Y\Xj) belongs to F, where 
F is the class of functions defined on [0, 1] whose Zth derivative satisfies a 
Lipschitz condition of order c: |/^'^(s) — f^'''{t)\ < co|s — fl*^, for some positive 
constant cq, s,t £ [0, 1], where I is a nonnegative integer and c G (0, 1] satisfies 
d = / + c>0.5. 

(C2) mmj^M^E{Q^{Y\Xj) - Qa{Y)f > cm"^ for some < r < ^J^ 
and some positive constant ci. 

(C3) The conditional density fyix {t) is bounded away from and oo on 
[Qa{Y\Xj) -C,Qa{Y\Xj) + ^], for some ^ > 0, uniformly in Xj. 

(C4) The marginal density function gj of Xj, 1 < j < p, are uniformly 
bounded away from and oo. 

(C5) The number of basis functions N satisfies N~ nT' = o(l) and Nn?'^~^ - 
o(l) as n — )• oo. 

Condition (CI) assumes that the conditional quantile function Qa(Y\Xj) 
belongs to a class of smooth functions. This condition is standard for non- 
parametric spline approximation. Condition (C2) assumes that the features 
in the active set at quantile level a have strong enough marginal signals; a 
smaller r corresponds to a stronger marginal signal. This condition is im- 
portant as it guarantees that marginal utilities carries information about 
the features in the active set. Condition (C3) is a standard condition on 
random errors in the theory for quantile regression. It relaxes the usual sub- 
Gaussian assumptions that are needed in the literature on high dimensional 
inference. Condition (C4) is similar as condition (B) of Fan, Feng and Song 
(2011). Note that (C4) is not restrictive when Xj is supported on a bounded 
interval, say [0, 1]. When the distribution of Xj has an unbounded support 
(e.g., normal), we can view Xj as coming from a truncated distribution. In 
fact, if there is an outlier in Xj as in the case of a heavy-tailed distribution, 
we do better by dropping the outlier or transforming Xj to be uniformly 
distributed on [0, 1]. In our numerical simulations of Section 5, the normally 
distributed covariates are scaled to the interval [0, 1] and the results would 
change little if any reasonable truncation is used instead. Condition (C5) 
describes how fast the number of basis functions is allowed to grow with the 
sample size. 

Given (y, X), where X = {Xi, . . . ,Xp)^ , we define 

(3.1) /3oj = argminE[p„(y - iT{X,f (3) - p,(y)]. 

Let fnj{t) = '^{'t)'^Poj — Qa{y), whose sample version was defined in Sec- 
tion 2. Furthermore, we let ||/nj|P = E[fnj{Xj)'^]. The following lemma 
shows that the spline approximation error is negligible in the sense that 
the spline approximation carries the same level of information about the 
marginal signal. 
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Lemma 3.1. Under condition (C5), uiiuj^Ma ll/nj|P > cin~'^/8, for all 
n sufficiently large. 

3.2. Sure screening property. As covariate screening often serves only as 
the first step for high dimensional data analysis, the most important prop- 
erty as far as practical application is concerned is the sure screening property. 
In the quantile-adaptive framework, we require that the sure screening prop- 
erty holds at each quantile level a, that is, the set of selected covariates at 
quantile level a includes M^ with probability tending to one. 

The key step of deriving the sure screening property is to establish expo- 
nential probability bounds for \\Pj — Pqj\\ and ||/nj||n ~ ll/njlP- The main 

technical challenge is that (3j is defined by minimizing a nonsmooth objec- 
tive function, thus does not have a closed-form expression. The exponential 
bounds are summarized in the following lemma. 

Lemma 3.2. Assume conditions (C1)-(C5) are satisfied. 

(1) For any C > 0, there exist positive constants C2 and C3 such that 

P( max 0j - pQjW > CN^^^ii'A 

< 2pexp(-C2n^-''^) + pexp{-C3N~^n^~^'') 

for all n sufficiently large. 

(2) For any C > 0, there exist some positive constants 5i and 62 such that 

p(^ma^J\\lj\\l-\\fnjf\>Cn~^) 

<p{llexp(-(5in^~^^) + 12iV2exp(-(52iV~^n^~2^)}, 
for all n sufficiently large. 

Remark. The results suggest that we can handle the dimensionality 
logp = o{n^~^'^ + N~'^n^~'^'^). This dimensionality depends on the number 
of basis functions N and the strength of the marginal signals. If we take 
7Y = jii/(2a+i) (^the optimal rate for spline approximation), then for r < 
min(l/4, (d— l)/{2d + 1)), we can handle ultra-high dimensionality, that is, 
p can grow at the exponential rate. 

The following theorem establishes the sure screening property. 

Theorem 3.3 (Sure screening property). Under the conditions of Lem- 
ma 3.2, if T < 1/4, N^n?'^~^ = o(l), and we take the threshold value Vn = 
6*n~'^ with 6* < ci/16 for the constant c\ specified in Lemma 3.1, then 

P{Ma C Ma) > 1 - 5a{llexp(-(^in^-^^) + 12N^ exp{-62N~'\^~'^^)}, 

for all n sufficiently large. Especially, P{Ma C Ma) —5-1 as n —t- 00. 
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3.3. Controlling false discovery. An interesting question is how many 
variables are retained after the screening. A simple bound is provided below, 
which extends the results in Fan, Feng and Song (2011). 

Let n = (7r(Xi ),..., 7r(Xp))^ and S = E(nn^). Let || • \\f denote the 
Frobenius matrix norm and let || • ||2 denote the spectral matrix norm. Note 
that 

E ll/«^-|l' = EE(7r(X,-)/3o,)' < X; Amax(E7r(X,-)7r(X,-f )||/3o,f 



< N^tTa.ce{En{Xj)7v{Xjf) < NE 

< iVE(||n||2,) < N^E{\\U\\l) = iV2A^ax(5]), 



p N 

,j=i k=i 



where the second inequality uses the result ||/3oj|P < cN for some posi- 
tive constant c (proved in the supplemental material [He, Wang and Hong 
(2013)]). 

For any e > 0, we define the set 

Dn = { max^|||/„j||2 - \\fnjf\ < en~^y 

Then on Z)„, the cardinality of {j : \\fnj\\n > 2en~'^} cannot exceed the car- 
dinality of {j : ll/njiP > eJ^"^}, which is bounded by e~^A^^n'^Amax(^)- 

For i/„ = 5*n~'', we take e = 5*/2, then P(|M„| < e'^N'^n'^Xn.s.^i'E)) > 
P{Dn). Applying Lemma 3.2(2), we obtain the following theorem which 
provides a bound on the size of selected variables. 

Theorem 3.4. Under the conditions of Theorem 3.3, there exist some 
positive constants 5i and 82 such that for all n sufficiently large, 

> 1 -p{llexp(-5in^-''^) + l2N'^exp{-52N~^n^''^'')}. 
Especially, P(|M„| < IN'^n^Xms.A'^)/^*) "^1 as n^ 00. 

The above theorem suggests that if Amax(S) = 0{n'^) for some 7 > 0, 
then the model obtained after screening is of polynomial size with high 
probability. Similar observation has been reported for the -L2-based screening 
procedure of Fan, Feng and Song (2011). 

4. Quantile-adaptive screening in survival analysis. There exists very 
limited amount of work on feature screening with censored responses. Fan, 
Feng and Wu (2010) and Zhao and Li (2012) investigated marginal screening 
based on the Cox proportional hazards model. As a powerful alternative to 
the classical Cox model, quantile regression has recently emerged as a useful 
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tool for analyzing censored data, see Ying, Jung and Wei (1995), McKeague, 
Subramanian and Sun (2001), Portnoy (2003), Peng and Huang (2008), 
Wang and Wang (2009), among others. The quantile regression approach 
directly models the survival time and is easy to interpret. Furthermore, it 
relaxes the proportional hazards assumption of the Cox model and can natu- 
rally accommodate heterogeneity in the data. The quantile regression based 
screening procedure can be naturally extended to survival analysis. 

Assume that Yi is subject to random right censoring. Instead of {(Xj, Yi), 
z = 1, . . . ,n}, we observe {(X.i,Y*,6i),i = 1, . . . ,n} where 

(4.1) Y* = min(y„ Q), 6., = I{Yi < d). 

The random variable Cj, called the censoring variable, is assumed to be 
conditionally independent of Yi given Xj. In this section, we assume that 
the censoring distribution is the same for all covariates, but this assumption 
will be relaxed in Section 7.1. Let G{t) = P{Ci > t) be the survival function 
of Cj. Let G{t) be the Kaplan-Meier estimator of G(t), based on {Y*,5i\, 
i = 1, . . . ,n. 

Similarly as in the case of complete data, we consider independence screen- 
ing based on nonparametric marginal regression given Xj . More specifically, 
we consider inverse probability weighted marginal quantile regression esti- 
mator 

n f- 

p'j = avgmm^ ' pa{Y* - 7v{Xij)'^P). 
i=i GiYi ) 



Let fnj{t) = ■^(*)"^/3j — -^KM n('^) where -F'km n('^) ^^ *^^ nonparametric esti- 
mator of the ath conditional quantile of Y based on {Y*,Ci,6i), i = 1, . . . , n. 
The estimator we use here is the inverse function (the left-continuous ver- 
sion) of the Kaplan-Meier estimator of the distribution function of Y, whose 
properties have been studied in Lo and Singh (1986). We will select the sub- 
set of variables M^ = {1 < i < p: |l/nilln ^ ^nl' where v!^ is a predefined 
threshold value. As for the complete data case, in practice we often rank the 
features by ||/^j||^ and keep the top [n/log{n)] features. 

For the random censoring case, in addition to conditions (C1)-(C5), we 
assume that: 

(C6) P{t <Yi< Ci) > To > for some positive constant tq and any t G 
[0,r], where T denotes the maximum follow-up time. Furthermore, sup{t : 
P{Y > t) > 0} > sup{t : P{C > t) > 0}. The survival function of the censor- 
ing variable G{t) has uniformly bounded first derivative. 

(C7) There exist < /3i < /32 < 1 such that a G [/3i , j32] and that the dis- 
tribution function of Yi is twice differentiable in [Qp^iYi) — E,Qp^{Yi) -\- e] 
for some < e < 1, with the first derivative bounded away from zero and 
the second derivative bounded in absolute value. 
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Condition (C6) and (C7) are common in the survival analysis literature 
to ensure that the Kaplan-Meier estimator and its inverse function are well 
behaved. The following theorem states that the sure screening property holds 
for the random censoring case and an upper bound that controls the size of 
selected variables can be obtained. 

Theorem 4.1. Assume conditions (C1)-(C7) are satisfied, r < 1/4 and 
jy3^2T~i _ 0(1)^ if y;g fdf^g ifi^ threshold value u^ = 6*n~'^ with S* < ci/16, 
then: 

(1) there exist positive constants 83 and 64 such that 

P{Ma C M^) > 1 - 5„{17exp(-(53n^-^^) + 12N^ exp{-5iN-^7i^-^^)}, 

for all n sufficiently large. Especially, P{Ma C M^) — )■ 1 as n^ 00. 

(2) for all n sufficiently large, 

P(|K|<2Ar2n^A„,ax(S)/r) 

> 1 -p{17exp(-67n^-^^) + 12N^ eicp{-hfiN~^n^~'^^)]. 

Especially, P{\M^\ < 2iV2n^Amax(5])/(5*) ^1 as n^ 00. 

5. Monte Carlo studies. We carry out simulation studies to investigate 
the performance of the proposed quantile adaptive sure independence screen- 
ing procedure (to be denoted by QaSIS). We consider two criteria for evalu- 
ating the performance as in Zhu et al. (2011). The first criterion is the min- 
imum model size (denoted by TZ), that is, the smallest number of covariates 
that we need to include to ensure that all the active variables are selected. 
The second criterion is the proportion of active variables (denoted by S) 
selected by the screening procedure when the threshold f^ = [n/log(n)] is 
adopted. Note that the first criterion does not need to specify a threshold. 
An effective variable screening procedure is expected to have the value of 7^ 
reasonably small comparing to the number of active variables and the value 
of S close to one. 

We first consider the complete data case and compare the performance of 
QaSIS with the nonparametric independence screening (NTS) procedure of 
Fan, Feng and Song (2011) and the sure independent ranking and screening 
(SIRS) procedure of Zhu et al. (2011). In computing QaSIS and NIS, the 
number of basis (dn) is set to be [n^' ^] = 3. For each example, we report the 
results based on 500 simulation runs. 

Example 1 (Additive model, n = 400, p = 1000). This example is adapted 
from Fan, Feng and Song (2011). Let gi{x) = x, g2{x) = (2x — 1)^, g2,{x) = 
sin(27rx)/(2 — sin(27rx)), and 54 (x) = 0.1sin(27r2;) + 0.2cos(27rx) + 
0.3sin(27rx)^ + 0.4cos(27rx)'^ + 0.5sin(27rx)^. The following cases are studied: 
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• Case (la): Y = bgi{Xi) + 3(72(^2) + ^g-i{X:i) + 654(^4) + Vl^e, where 
the vector of covariates X = (Xi, . . . ,Xiooo)^ is generated from the mul- 
tivariate normal distribution with mean and the covariance matrix 
^ = (o-jj)ioooxiooo with an = 1 and aij = pi*"-'! for ii^ j, e ~ iV(0, 1) is 
independent of X. In case (la), we consider p = 0. 

• Case (lb): same as case (la) except that p = 0.8. 

• Case (Ic): Same as case (lb) except that e has the Cauchy distribution. 

Note that the models are homoscedastic in Example 1, thus the number of 
active variables are the same across different quantiles. 

Example 2 (Index model, n = 200, p = 2000). This example is adapted 
from Zhu et al. (2011). The random data are generated from Y = 2(Xi + 
0.8X2 + 0.6X3 + 0.4X4 + 0.2X5) + exp(X2o + X2i+X22)-e, where e~Af (0,1), 
X = (Xi,X2, . . . ,-'^2000) follows the multivariate normal distribution with 
the correlation structure described in case (lb). Different from the regression 
models in Example 1, this model is heteroscedastic: the number of active 
variables is 5 at the median but 8 elsewhere. 

Example 3 (A more complex structure, n = 400, p = 5000). We consider 
a more complex heteroscedastic model for which the conditional distribution 
of Y does not have a simple additive or index structure. 

. Case (3a): y = 2(X2+X|)+exp((Xi + X2 + Xi8 + Xi9 + --- + X3o)/10)-e, 
where e ~ A^(0, 1), and X = (Xi, X2, . . . , ^5000)"^ follows the multivariate 
normal distribution with the correlation structure described in case (lb). 
In this case, the number of active variables is 2 at the median but is 15 
elsewhere. 

• Case (3b): same as case (3a), but with 2[Xl + X'2) replaced by 2((Xi + 

1)2 + (^2 + 2)2). 

The median value of TZ (with IRQ in the parenthesis) and the average 
value of S for QaSIS, NIS and SIRS are summarized in Table 1. For Qa- 
SIS, we report results for two quantiles a = 0.5 and 0.75. We observe the 
following from Table 1: (i) The L2 norm based NIS procedure exhibits the 
best performance when the random error has a normal distribution, but its 
performance deteriorates substantially for heavy-tailed or heteroscedasitic 
errors (Examples 1-2). (ii) We observe that in case (la) where p = 0, no 
method works really well in terms of the minimum model size. This is be- 
cause the independent signals work against the marginal effect estimation 
as accumulated noise, thus masking the relatively weak signals from X-^ and 
X4 in this model, (iii) In Example 3 where the model has a more complex 
structure, QaSIS is effective in identifying the number of active variables at 
different quantiles; while the performance of SIRS depends on the functional 
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Table 1 

Results for Examples 1-3. The numbers reported are the median of TZ [with interquartile 

range (IQR) given in parentheses] and S 



Example 



Case 



Method 



P 



7?.(IQR) 



Example 1 



(la) 



(lb) 



(Ic) 



Example 2 



Example 3 



(3a) 



(3b) 



QaSIS (q = 0.50) 


4 


655 (434) 


0.56 


QaSIS (a = 0.75) 


4 


652 (398) 


0.56 


NIS 


4 


660 (415) 


0.55 


SIRS 


4 


689 (365) 


0.53 


QaSIS (a = 0.50) 


4 


4(0) 


1.00 


QaSIS (a = 0.75) 


4 


4(0) 


1.00 


NIS 


4 


4(0) 


1.00 


SIRS 


4 


6(9) 


0.99 


QaSIS(Q = 0.50) 


4 


4(0) 


1.00 


QaSIS(Q = 0.75) 


4 


4(0) 


1.00 


NIS 


4 


6(79) 


0.83 


SIRS 


4 


7(14) 


0.98 


QaSIS [a = 0.50) 


5 


6(2) 


1.00 


QaSIS (a = 0.75) 


8 


18 (24) 


0.96 


NIS 


8 


1726 (511) 


0.22 


SIRS 


8 


18 (16) 


0.97 


QaSIS (a = 0.50) 


2 


3(2) 


1.00 


QaSIS {a = 0.75) 


15 


153 (207) 


0.89 


NIS 


15 


3117 ( 4071) 


0.50 


SIRS 


15 


698 (1140) 


0.89 


QaSIS (a = 0.50) 


2 


2(1) 


1.00 


QaSIS {a = 0.75) 


15 


88 (542) 


0.92 


NIS 


15 


4166 (1173) 


0.25 


SIRS 


ir, 


29 (21) 


1.00 



p": the number of truly active variables. 

form. Overall, our simulations for the complete data case demonstrate that 
the performance of QaSIS is on par with or better than that of NIS and 
SIRS for a variety of distributions of covariates and errors. 

Variable screening with censored responses has received little attention 
in the literature. In Example 4 below, we compare the quantile-adaptive 
nonparametric marginal screening procedure proposed in Section 4 with the 
Cox model based marginal screening procedure [Cox(SIS)] of Fan, Feng and 
Wu (2010) and a naive procedure treating the censored data as complete 
and then applying to the QaSIS procedure (denoted by Naive). 



Example 4 (Censored responses). We consider a case in which the la- 
tent response variable Yi is generated using the same setup as in case (lb). 
Let Y^ =min(5^,Cj), where the censoring time Cj is generated from a 3- 



4(2) 


0.99 


5(22) 


0.96 


4(6) 


0.98 


5(13) 


0.98 


254 (497) 


0.74 


792 (351) 


0.14 


190 (655) 


0.70 
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Table 2 

Simulation results for Example 4 (n = 400, p — 1000^. The numbers reported are the 

median of TZ [with interquartile range (IQR) given in parentheses] and S 

Method p* 7?.(IQR) S 

QaSIS (a = 0.50) 4 

QaSIS (a = 0.25) 4 

LQaSIS (q = 0.50) 4 

LQaSIS (q = 0.25) 4 

Naive (a = 0.50) 4 

Naive (a = 0.25) 4 

Cox (SIS) 4 

p*: the number of truly active variables. 

component normal mixture distribution 0.4A^(— 5,4) + 0.1A^(5,1) + 
0.5A^(55, 1). The censoring probability is about 45%. Due to the high censor- 
ing rate, the performance of the variable screening procedures is investigated 
at the median and the 0.25 quantile. 

Table 2 summarizes the simulations results based on 100 runs. QaSIS 
substantially outperforms both Naive and Cox(SIS). Under-performance of 
Cox(SIS) can be attributed to the fact that the proportional hazards as- 
sumption is not satisfied in this example. Table 2 also includes the LQaSIS 
procedure which will be discussed in Section 7.1. 

6. Real data analysis. We illustrate the proposed screening method on 
the diffuse large-B-cell lymphoma (DLBCL) microarray data of Rosenwald 
et al. (2002). The data set contains the survival times of 240 patients and the 
gene expression measurements of 7399 genes for each patient. The gene ex- 
pression measurements for each gene are standardized to have mean zero 
and variance one. To assess the predictive performance of the proposed 
method, we divide the data set into a training set with ni = 160 patients 
and a testing set with remaining re2 = 80 patients, in the same way as Bair 
and Tibshirani (2004) did. The index of the training set is available from 
http: //www- Stat . stanford.edu/~tibs/superpc/staudt .html. 

Nearly half of the survival time data are censored, so we focus our atten- 
tion on two quantile levels a = 0.2 and 0.4 that represent the effects of gene 
expression on the sub-population of patients with poor prognosis. We apply 
the proposed QaSIS method to the training data to select [rei/log(ni)] = 31 
genes, which is followed up by a variable selection procedure based on addi- 
tive quantile regression with the SCAD penalty [Fan and Li (2001)] to find 
two top genes. As with the empirical studies in the simulation study, we use 
three internal knots of A^ = 3. Because almost all the censoring occurs above 
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Table 3 
Estimated slope coefficients (and p-values) for survival time 
versus the risk score at ath quantile obtained by QaSIS and 

SIS (Cox) 

Method Estimated coefHcients p-value 

QaSIS (a = 0.4) 0.93 0.02 

QaSIS (a = 0.2) 0.53 0.04 

SIS (Cox) (q = 0.4) 0.16 0.62 

SIS (Cox) (a = 0.2) 0.17 0.62 



the 0.4th quantile, we do not need to re-weight the censored observations 
for the low quantiles we are considering in this example. Based on the two 
selected genes at each a, we estimate the corresponding quantile function. 
The estimated quantile function from the training set is then used to calcu- 
late risk scores for each patient in the testing data set. If Yi is the survival 
time of the ith patient in the training set, with Sj as the predicted risk score, 
we expect the ath quantile of Y^ given Sj to have a significant (and positive) 
slope. 

For the purpose of comparison, we also follow the same analysis path but 
replace QaSIS by the sure independence screening for Cox models, SIS(Cox), 
of Fan, Feng and Wu (2010) and the SCAD-penalized Cox regression to select 
two genes. The risk scores are then calculated based on the linear index 
for the Cox model. Table 3 summarizes the slope coefficients of regressing 
survival times on risk scores in the training set based on the censored quantile 
regression of Portnoy (2003). It is clear that the analysis based on QaSIS has 
the desired predictive power, where the 0.2 and 0.4 quantiles of survival time 
for the testing data set are significantly associated with the predicted risk 
scores, but the analysis based on SIS (Cox) did not make it. If we regress the 
survival time on the risk scores on the training data, we would get coefficients 
of exactly 1.0 under QaSIS, but it would not have validation power. 

We examined the two genes selected by QaSIS. At a = 0.4, their GenelDs 
are 31981 (known as AA262133, septin 1) and 17902 (AA284323, glutathione 
synthetase). Both genes belong to the known Proliferation signature group 
in the study of DLBCL by Rosenwald et al. (2002). Gene AA262133 was 
also ranked very high by Li and Luan (2005) using the partial likelihood- 
based scores. At a = 0.2, the two selected genes have IDs 31585 (known as 
NM 018518, MCMIO minichromosome maintenance deficient 10) and 33014 
(AA769543, Hypothetical protein MGC4189). We find that Gene 31858 also 
belongs to the known Proliferation signature group. Gene 33014 was identi- 
fied as an interesting candidate by Li and Luan (2005). We did not find any 
signature group associated with it, but it seems quite related to the lower 
tail of the survival time distribution, and is worth further investigation. 
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7. Discussions. 

7.1. Further extension on screening with censored responses. The as- 
sumption that the censormg distribution does not depend on the covari- 
ates is popular in regression analysis of survival data. It can be further 
relaxed. Assume that Yi and Ci are conditionally independent given Xij, 
j = 1, . . . ,p. Let G{t\Xij) = P{Ci > t\Xij) be the conditional survival func- 
tion of Ci given Xij. Let G{t\Xij) be the local Kaplan-Meier estimator of 
G{t\Xij) [e.g., Beran (1981) and Gonzalez-Manteiga and Cadarso-Suarez 
(1994)]. More specifically, 



''■'' ^^^■-^^j5{^- ELrAg^>^)i^..(j ^ 



where Bnk{x) = i^(2^)/{E"=i K{^)}, k = 1,. . . ,n, are the Nadaraya- 
Watson weights, /i„ is the bandwidth and K(-) is a density function. We 
consider estimating Pqj using the locally weighted censored quantile regres- 
sion, that is. 



Y^ "« „ f\^* —tv \T , 



pj = argmin^^ p^{Y* - 7v{Xij) /3). 

"' " i=l '^{^i \^ij) 



Let f^^it) = TTitffB^ - F-l,Ja) and define M^ = {l<j<p: \\f^^\\l > rj'^} 
where ry^ is a predefined threshold value. We refer to this new procedure as 
LQaSIS, whose numerical performance is reported in Table 2 and two other 
examples in the supplemental inaterial [He, Wang and Hong (2013)]. 
We assume, instead of (C6): 

(C6') mixP{t ^Yi < Ci\x) > tq > for some positive constant tq and 
any t € [0,r], where T denotes the maximum follow-up time. G{t\x) has 
first derivatives with respect to t, which is uniformly bounded away from 
infinity; and G{t\x) has bounded (uniformly in t) second-order partial deriva- 
tives with respect to x. Furthermore, to < sup{t : G{t\x) > 0} <ti uniformly 
in X for some positive constants to a-nd ti, and sup{t:P(y > t\x) > 0} > 
sup{t : G{t\x) > 0} almost surely for x. 

Then Theorem 4.1 can be extended as follows. 

Theorem 7.1. Assume conditions (C1)-(C5), (C6') and (C7) are sat- 
isfied, T < 1/4, n/i^->oo, iV^n^^~^ =o(l), iV^n^^~^(logn)^/i~^ = o(l) and 
(iV + n'^)n'^h'^ = o(l). // we take Vn = 5*n~'^ with 5* < ci/16, then: 

(1) there exist positive constants 83 and 64 such that 

P{Ma C M^) > 1 - 5„{17exp(-(53n^-^^) + 12Af2exp(-54iV"^i^"^^)}, 

for all n sufficiently large. Especially, P{Ma C M^) — )• 1 as n —)• 00. 
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(2) 

P(|K|<2iV2n^A„,ax(S)/5*) 

for all n sufficiently large. Especially, P{\M^\ < 2A^^n'^Aniax(^)/^*) — >• 1 as 
n — )• oo. 

The proof of Theorem 7.1 is given in the supplemental material [He, Wang 
and Hong (2013)]. 

7.2. Limitations and other issues. We have not investigated the prob- 
lem of adaptively selecting the number of basis functions in this paper for 
two reasons: (1) although adaptive tuning is possible, it will significantly 
increase the computational time as it needs to be done for each covariate 
separately; (2) optimal estimation is not the goal for marginal screening, in- 
stead consistent estimation generally suffices. Empirically, we find that 3 or 
4 internal knots are generally enough to flexibly approximate many smooth 
functions typically seen in practice. 

Marginally unimportant but jointly important variables may not be pre- 
served in marginal screening. This is a well-recognized weakness of all exist- 
ing marginal screening procedures. Iterative procedures may be helpful to a 
certain degree for this problem [Fan and Lv (2008)]. In the same spirit, we 
find that in practice a slightly modified QaSIS helps in situations where a 
dominating variable increases the error variance of the marginal regression 
model for other variables and hence mask the significance of other variables. 
If the top ranked variable is dominating, then the modified QaSIS removes 
its effects from Y first and screen the remaining variables again. 

The way we define the set of active variables can be considered as a 
nonparametric approach in the sense that we consider directly the condi- 
tional quantile function without a specific model structure. In real life high- 
dimensional data analysis, the knowledge needed for an appropriate model 
specification is often inadequate. Using a misspecified model to perform vari- 
able selection is likely to produce misleading results. We propose to separate 
variable screening and model building, where a nonparametric approach is 
applied to screen high-dimensional variables and then followed by sensible 
model building in the second stage in a lower-dimensional space. We expect 
that this model-free approach to variable screening to gain momentum in 
ultra-high dimensional learning, see, for example, the work of Li, Zhong and 
Zhu (2012) on distance correlation based screening. 

8. Technical proofs. We present the proof for the random censoring case, 
as this is the more challenging scenario. The proof for the complete data case 
and that for Lemma 3.1 are given in the supplemental material [He, Wang 
and Hong (2013)]. 
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To establish the sure independence property, a key step is to obtain an 
exponential tail probability bound for 

(8.1) P( maxj||/;^^.||^ - ||/„^.||2| >Cn--), 

where C is any positive constant. We have 

||/;;^.|t^ = 3f(P„7r(X,)7r(X,)^)3;-2FKiJa)n-i(P„7r(X,))^3; 

+ (^KM,n("))'> 

where P„7r(Xj)7r(Xj)^ = n-^y:'l^^7r{Xij)7r{Xijf , Fn7v{Xj) 

Yl^=i '^{^ij)i &^d FiTv{Xj)7v{Xj) denotes the expectation of 7v{Xj)7t{X 

under the true distribution of Xj . Note that 

llfc ||2 _ II f i|2 
WJnjWn WJnjW 



n ^ X 



(/3, - (3,jy {FrMX.MXjY )(/3, - (3o,) 
+ 2(3 ■ - Po^f{¥n7v{X,)7v{X,f)f3^ 



Oj 

+ (3^j{¥n7v{Xj)7v{Xjy' - E7v{Xj)7v{Xjy)poj 
- '^F^uJaWnT^iX.fP'j - E7viXjf(3,^] 
+ 2[Q^iY)-F^l,Ja)]{E7v{Xjfp,^) 

+ [{FKL,ni^)f-iQa{Y)f] 

6 

= / v'^jfc' 
fc=l 

where the definition of Sjk is clear from the context. From the argument of 
Lemma 3.1, E{tv{XjY' (3qj) is uniformly bounded in Xj and by Lemma 8.4(4) 

below, we have \Sj^\ = 0(n~^'^(logn)-^'^) = o{n~'^) almost surely. Similarly, 
\Sjq\ = 0(n~"^'^(logn)^'^) = o{n~'^) almost surely. Therefore, for all n suffi- 
ciently large, 

"'C^lll^nilln-lUn, 



P(max|||/;;^.||2-||/,^.||2|>Cn-- 

I max V|S,fc|>Cn~^/2 I 



<P\ 



<yP( max \S^k\>Cn~^i 
^-^ \i<i<p ■' 



k=i '^^^^ 
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In the following, we shall provide details on deriving exponential tail bound 
for P(maxi<j<p |5jfc| > Cn~'^/8). 

8.1. Properties of the spline basis. First, we recall some useful properties 
of the basis vector 7r(t) = (5i(t), . . . ,5Ar(t))^. Zhou, Shen and Wolfe (1998) 
established that there exist two positive constants 6i and 62 such that 

(8.2) 

< b2N-^ Vj, 

where Amin and Amax denote the smallest eigenvalue and the largest eigen- 
value, respectively. 

Stone (1985) established that there exists a positive constant 63 such that 

(8.3) E{Bl{Xij))<b3N-'^, l<k<N,l<i<n,l<j<p. 
Similar result can be found in He and Shi (1996). 

Lemma 8.1. Let Fn7v{Xj)7v{Xj)'^ = n'^ Y17=i 7^{Xij)7v{Xij)'^ and Bj = 
¥n7r{X,)7v{X,f - E7viXj)7riXjf . 

(1) There exists a positive constant 04 such that for all n sufficiently large 

(8.4) P{X^^,{Fn7T{Xj)7T{Xjf) > (62 + l)iV"') < 2N^eM-C4nN-^), 

(2) For any C5 > 0, there exists a positive constant cq such that for all n 
sufficiently large 

P(max(|Ai„ax(Dj)|,|Ainin(Dj)|)>C5iV-in-") 
(8.5) 

< 2N'^ expi-ceN-^n^-^^). 

Proof. The proof is an extension of that for Lemma 5 in Fan, Feng 
and Song (2011) which proved similar results for the smallest eigenvalue. 
First, for any two symmetric matrices A and B, we have Amax(A + B) < 

Amax(A) + Amax(B). This implies that Amax(A) - Amax(B) < Amax(A - B) 

and Amax(B) - Amax (A) < Aniax(B - A). Thus 

I Amax (A) - Amax (B) I < max{|Amax(A - B)|, |Amax(B - A)|}. 

Applying the above inequality, we have 

\Xrn.A^n7v{Xj)7T{Xjf) - X^,^{E7T{Xj)7T{Xjf)\ 

(8.6) 

< max{|Amax(Di)|, |Amax(-Dj)|}. 

For any A^-dimensional vector a = (ai, . . . , a^)'^ satisfying ||a|| = 1, we have 

|a-^Dja| < ||Dj||oo(X^j=i l^«l)^ — -^llDjUoo, where ||Dj||oo is the sup norm of 
the matrix Dj. Thus Amax(Dj) = max||a||^]^a'^Dja < A^||Dj||oo- Also 
Amax(Dj) = -min||a||=i(-a^Dj-a) > -7V||Dj||oo. Thus |Amax(Dj)| < iV||Dj||oo. 
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Similarly, we have |Amax(— Dj)| < iV||Dj||oo- Following (8.6) and using the 
result on the smallest eigenvalue of Dj [Fan, Feng and Song (2011)], we have 

|A„,ax(Pn7r(X,-)7r(X,-)^) " Xra.A^7v{Xj)7v{Xjf)\ 
(8.7) 

<max(|Amax(Dj)U-^min(Dj)l) <^l|Dj||oo- 

As in Fan, Feng and Song (2011), applying Bernstein's inequality to each 
entry of Dj, it can be shown that V5 > 0, 

(8.8) F(A.||D,||. > NS/n) < 2iV^exp{-^^^-^^l^^}. 

To prove (8.4), we use the bound in (8.2), apply the inequality in (8.7) 
and take S = N~'^n in (8.8). This gives 

P{Xm.A^n7viXj)7viXjf) > (62 + l)N-^) < 2N^ exp{-CiN-''n), 

for some positive constant C4 for all n sufficiently large. 

To prove (8.5), we apply the inequality in (8.7) and take 6 = c^N^'^n^^'^ 
in (8.8). This gives 

P(max(|A,nax(D,)|,|A,nin(D,)|)>C5iV-^n-^)<2A^2g^p(_^^^-3^1-2r)^ 

for some positive constant cq for all n sufficiently large. D 

— C 

8.2. An exponential tail probability bound for ||/3- — Pqj\\- Let 

n 

Bn{l3) = n-^Y.^,[G{Y*)r^[p^{Y* - n{X,,f(3) - p^{Y*)], 
1=1 

Bid) = E{S,[G{ynr\PaiY* - 7T{X,,fP) - PaiY*)]}. 
— c 

Then /3j = argmin^gjgjv Bn{(3)- Applying the iterative expectation formula, 
we have 

B{p) = E{E{/(y, < a^[G{Y,)r\pa{Yi - 7v{Xijf(3) - pa{Yi)]\Y„ X^j}} 

= EK(y, - 7v{x,,fp) - p„(y,)]- 

Hence, Pqj = argmin^gj-iv B{P). 

We can bound the difference ||/3j — /^ojll by the difference of their respec- 
tive objective functions. 

Lemma 8.2. For any 6 >0, 

P(||3;-/3oill>'^) 
(8.9) 

1 



<P( sup \Bn{(3) - Bm > - inf (BiP) - B{(3,j)) 

/3-/3o,ll<5 ^ ll/3-/3(„||=5 
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Proof. This is a direct application of Lemma 2 of Hjort and Pollard 
(1993) making use of the convexity of the objective function. D 

The lower bound of the right-hand side of (8.9) can be explicitly evaluated 
for any given 6 > 0. This is summarized in the following lemma. 

Lemma 8.3. Let C > be an arbitrary constant. Assume that N~ v7 = 
o(l), then there exists a positive constant 64 such that 

inf {B{(3)-B{(3^.))>bm~^^ 

||/3_/3,y||=CArV2ri,-^ 
for all n sufficiently large. 

Proof. We consider (3 = Pqj + CN^^'^n~'^u, where u G M^ satisfying 
||u|| = 1. Using the identity by Knight [(1998), page 758], we have 

= E{p„(y - 7r{Xjf(3oj - CN'/^n~-7v{X,fn) - p^{Y - 7r{Xjf(3o,)} 
= CN^/'^n~^E{7T{Xjfu[I{Y - 7T{Xjf(3oj < 0) - r]} 

rCN^/'^n-^-K(Xj)Tu 



+ E<| / ^ [/(y-7r(X,)^/3o,<s) 



-/(y-7r(X,)^/3o,<0)]ds 
= C7Ari/2n--E{7r(X,)^u[Fy|^^,(7r(X,)X-) " Fy\xM^M 

+ E|y^ [Fy^x^{7v{Xjff3,^ + s) 

-Fy\jc,{7viX,fP0^)]ds 

By Holder's inequality, we have 

|/i| < C7iVi/2^-^(E(7r(Xj)^u)2)^/^ 

X [E(Fy|^^,(7r(X,)X') - FY\x,{fj{Xj))ff^ 

where the second inequality uses inequality (B.3) in the supplementary ma- 
terial and (8.2). 



\2 
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Furthermore, for some ^ between 7r{Xj)^ f3Qj + s and 7v{Xj)^f3Qj, 

( j-CN^/'^n-^Tz{Xj)'^\i ^ 

= 0(n~2-) 

by (8.2). Note that I2 is nonnegative and Ii = 0(12). Thus, the conclusion 
of the lemma holds. □ 

Lemmas 8.4-8.6 below provide several useful technical results for evalu- 
ating the right-hand side of (8.9). 

Lemma 8.4. Assume conditions (C6) and (C7). The Kaplan-Meier es- 
timator G{t) satisfies: 

(1) supo<t<r|G(t) -G(i)| =0(n~^/2(logn)^/2) almost surely. 

(2) d{t)-^ - G{t)-' = n-iE"=i ^-%W^ + Mt), where aY*,6j,t) are 
independent m,ean zero random variables whose expression is given in The- 
orem 1 of Lo and Singh (1986), and supo<t<r |-Rn(OI = 0(n~^''*(logn)'^'^) 
almost surely. 

(3) supo<t<T Ig^ - Gliyl = 0(n"^/^(log?i)^/2) almost surely. 

(4) sup^^<^<^2 I^KM,n(«) - Qr{Y)\ = ©(^-^/^(log n)!/^) almost surely. 

Proof. The results in (1) and (4) are given in Lemma 3 of Lo and Singh 
(1986). The result in (2) follows from the Taylor expansion, Theorem 1 in 
Lo and Singh (1986) and the result in (1). The proof of (3) follows Taylor 
expansion and (1). D 

Lemma 8.5 (Massart's concentration theorem, 2000). Let Wi,...,Wn 
be independent random variables and let G be a class of functions satisfying 
0'i,g < fi'(M^j) < bi^g for some real numbers Oi^g and hi^g, and for all 1 <i <n 
and g ^ G. Define L^ = su]Pg^QY17=ii^i,g ~ Oi,g)^/n and Z = 
suPggG^~^II]r=i(5'(^i)--E'(5(W^i)))l- Then for any positive t, P{Z >EZ -\- 
t)<exp[-Mi]. 

Lemma 8.6 [Bernstein inequality for [/-statistics, Hoeffding (1963)]. Let 
^nid) denote the second-order U -statistics with kernel function g{ti,t2) based 
on the independent random variables Zi, . . . , Z„. Assume that the function g 
is bounded: a< g <b for some finite constants a and b. If E{g{Zi,Zj)) = 0, 

Mi / j, then \ft > 0, P{\U!^{g)\ > t) < 2exp{-j^^^), where k denotes the 
integer part of n/2 . 

Lemma 8.7. Assume the conditions of Theorem 4-1- Tor any C > 0, 
there exist positive constants cy and cg such that for all n sufficiently 
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large 

P(||3j -/3oj|| > CiV^/2^"^) < 4exp(-C7ni-^^) + exp(-C8iV-V-2^). 

Proof. Following Lemmas 8.2 and 8.3, there exists some 64 > such 
that for all n sufficiently large, 



P{\\p]-(3,,\\>CN^'^n-^) 



<P( sup \Bn{l3)-B{P)\>bin-^^ 

\\/3~/3oj\\<CN^/^n-^ 



<p{\Bn{l3oj)-Bi(3,^)\>-b,n^^^ 



1. 
-( 
2 

1, 



+ P( sup \Bn{P) - BniPoj) - B{I3) + i?(/3o,)| > ^hm~^- 

= Jl + J2. 

First, we evaluate Ji. Let Wi = 6i[G{Y*)]-^[pa{Y* -7T{Xijf(3oj)-paiY*)]. 
Then 



Bn{(3oj)-Bif3oj) 

n 

= n~^Y.^Wi-EWi) 

i=l 
n 

+ n-'Y.6d{G{Yn)-' - {G{Y*))-%a{Y*-7v{X,,fp,^)-p^{Y*)] 



A 



Then Ji < P(|Ii| > 64n-2^/4) + P(|/2| > bm''^^ /^)- Note that \Wi\ < 



C\iT{Xij)'^ (3qA, for some positive constant C. By the argument of Lemma 3.1, 



supj \fj{t) — 7r(Xjj)-^/3oj| < C2N~'^. Thus, \Wi\ are uniformly bounded by a 
constant M. Applying Bernstein's inequality, there exists a positive constant 
65 such that for all n sufficiently large, 

P{\h\ > b,n~^y4) < 2exp -— 1^— -^4— < 2exp(-65ni-4-). 



2M2 + M64n-2^/3 
Furthermore, applying Lemma 8.4, 

n n 

n 
+ n-'Y,^^Rn{Yn[Pa{Y* - 7r{Xijf P^j) - PaiY*)] = hi + ^22, 
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where ^ and i?„ are defined in Lemma 8.4. By Lemma 8.4, I22 = 
0(n~^'^(logn)^'^) almost surely. By assumptions n~^'^(logn)^'^ = o(n~^'^), 
and noting that 6,G-^{Y*)i{Y* ,5j,Y*)[Pa{Y* - 7T{Xijf {3^^) - p^{Y*)] are 
independent bounded random variables, we have for all n sufficiently large, 

P{\h\ > b^n-^^A) 

( n n 

^ > i=\ j=i,jjti 

X [p^iY* - 7v{X,,ff3o,) - Pe^iY*)] > 64^-278 1 



where bQ is a positive constant, by Lemma 8.6. Therefore, Ji < 4exp(— cyn^"^"^) 
where cj = min(65, ^g). 

Next, we evaluate J2. Let Vi{(3) = pa{Y* -7v{Xij)'^ (3) - pa{Y* -TviXy)^ (3oj) 
and let Zi = 6i[G{Y*)]"^Vi{(3). We have 



J2<P\ sup 

\\\l3-l3o,\\<Cm/^n-^ 

+ sup 

ll/3-/3o,ll<C'7VV2n- 



n 



i=l 
n 

i~'Y.^A(G{Ynr'-iG{Y*)r'MP) 



j=i 



Applying Knight's identity [(1998), page 758], we have 

Vm = 7v{X,,f{P - (3o,)[IiY* - 7v{X,,f(3o, < 0) - r] 

7r(X.,)^(/3-/3o^) 



Thus, 



(8.10) 



+ / [IiY*-7v{X,,f(3o,<s) 

-IiY*-niX,,fPo,<0)]ds. 



sup \Vi{f3)\<2 sup \7v{Xijf{P-(3Qj) 



ll/3-/3o,||<CAfi/2,i 



ll/3-/3o,||<CiVi/2n- 

< cNn-'' 
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for some c > because ||-Bfc(-)||oo < 1- Combining (8.10) with Lemma 8.4(3), 
we have 



sup 

(i'(iQj\\<Cm/^n- 



n 



-'Y.^AiG{Y:)r' - {G{Y:)y^]vm 



4 = 1 



= 0{Nn-^-^l'^{\ogn)^'^) 
almost surely. Assume A^^n^^~^ log?i = o(l), then for all n sufficiently large, 



J2<P\ sup 

\\\li-l3oj\\<Cm/^n- 



n 



'^Y.[Z,-E(Z,)\ 



i=l 



> h^n-'^/A . 



We use Lemma 8.5 to evaluate the above inequality. First note that, (8.10) 
implies that sup||^_^ ■\\<CN'^/'^n-^ 1-^*1 — c*Nn~'^ for some positive constant 
c* . Next, let ei, . . . , en be a Rademacher sequence (i.e., i.i.d. sequence taking 
values of ±1 with probability 1/2) independent of Zi, . . . , Z„. We have 



E< 



sup 

/9-/3oj||<C7Vi/2n- 



n 



< 2E< sup n 

lll/3-/3oj|l<C^''''""" 

< CE< sup n~ 

lll/3-/3oj|l<C^''''"-" 



^{Zi-EiZ,)) 

n 

y^^ejZj 



i=l 

n 



^ei7r{X,jf{P-fi, 



i=l 



< CN^/^n-^E 



n 



'^^e^^^{Xij] 



i=l 



< CN^I^n-^ 



E 



Oj, 



n 



'^y^^ei-KJXij) 



i=l 



1/2 



CAtV^ 



n 



n 



'E[Y,eh{X,,Yn{X,,) 



\i=l 



1/2 



< CiVl/2„-r-l/2 



for some generic constant C which may vary from line to line. In the above, 
the first inequality applies the symmetrization theorem [Lemma 2.3.1, van der 
Vaart and Wellner (1996)], the second inequality applies the contraction 
theorem [Ledoux and Talagrad, (1991)] using the Lipschitz property of the 
quantile objective function, and the last inequality uses (8.3). Now, we ap- 
ply Lemma 8.5 to evaluate J2. Let Z = sup||/3_^ .||<AAfi/2„-T n~^\ Ylll=i{^i ~ 
E(Zj))|. In Lemma 8.5, we take t = h^n-'^^ /2 - CiV^/^^-^^^/^ ^^^^ jj ^ 
Ac^N'^n"'^'^ , which gives 

J2 = P{Z >EZ+ {hin~'^''/A - EZ)) 

< P{Z >EZ+ (6471-2^/4 - CiV^/^i-^-^/2)) 
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/ 12(64^-274 -C7iVl/2^— 1/2)2 \ ,,_2 l-2rN 

<«^P(-^^ 8,2jv2^^2. ^J<exp(-C8iV V 2-) 

for some positive constant cg and all n sufficiently large. D 

8.3. Proof of the Theorem 4-1- In this subsection, we establish the ex- 
ponential tail probability bounds for P{\Sjk\ > C?i-^/8), /c = 1, . . . ,4, which 
lead to the result of Theorem 4.1. 

An exponential tail probability bound for Sji. Recall that 



^^m _fv N_/x. nTn— = 






OjJ 



It follows from Lemmas 8.1 and 8.7 that for some C* > 0, 
P{Sji > Cn'^S) 

< P{X^,^{Fn7T{Xj)7v{Xjf) > (62 + l)iV~') 

+ P(||3 • - Pojf > ib2 + ir'CNn-yS) 
(8.11) 

< 2N^exp{-CinN~^) + Pi\\p'j-f3oj\\ > C*N^/^n-^/^) 

< 2iV2exp(-C4niV-3) + P(||3j -/9oill > C*N^/^n~^) 

< 2iV2exp(-C4niV-3) + 4exp(-C7ni-^^) +exp(-C8iV-^i^-2^). 
An exponential tail probability bound for Sj2- We first establish an upper 

bound for ||/9o}ll- ^Y result (B.3) in the supplemental material [He, Wang 
and Hong (2013)], E[fj{Xj) - fnj{Xj)f < c^N^^'^, Vj, for some C3 > 0. It 
follows that 

HfnjiXjf] < 2E[fjiX,f] + 2E[if,{X,) - fnj{X,)f] 
< eg + 2c3iV-2^ 

for some positive constant cg. Also note that 

nfnjiXj)^] > X^UE7v{X,)7v{X,f)\\p,jf > 6iiV-i||/3o,f . 
This implies that ||/9oill — ciov^ for some positive constant ciq. 
Since \Sj2\ < 2||/3j - /3oj||Amax(Pn7r(Xj)7r(Xj)^)||/3oj||, we have 
P{\Sj2\>Cn~^/8) 

< P(||3,' - /3o,l|Amax(IP„7r(X,)7r(X,)^) > CN-'/\~y{lQcio)) 

< P{Xra.A^n^iXj)7T{Xjf) > (62 + l)iV"^) 

+ P(||3 • - Poj II > (^2 + ir'CN^^n-yiWcw)) 
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An exponential tail probability bound for Sj^. We have 
Sj3 = 3 •(F„7r(X,)7r(X,)^ - E7r(X,)7r(X,)^)/3o,- 

= (3; - /3o,)(IPn7r(X,)7r(X,)^ - En{Xj)7viX,f)P,^ 

= Sj3i +5j32. 



Therefore, 

P{\Sj3\ > Cn-yS) 

< P{\Sj3i\ > Cn-716) + P{\S,32\ > C?i"Vl6) 
<P(||3j'-/3ojl|max(|A„,ax(Dj)|,|A„,in(D,-)|)>CiV~'/'n-7(16cio)) 

+ P(||/3o,-f max(|A^ax(D,-)|,|A^i„(D,)|)>Cn-7l6) 
<P(max(|A„,ax(Dj)|,|A„,in(D,)|)>iV~V(16cio)) 

+ P(max(|A^ax(D,)|, |A„,in(D,)|) > CiV-in-7(16c?o)) 

< 2P(max(|A,,ax(D,-)|, |A„,in(Dj)|) > C*N~\~^) 
+ P{\\d'j-Poj\\>CNy'n-) 

< 2N^ exp(-C6iV"^n^"^^) + 4exp(-C7ni-^^) + exp(-C8iV"^n^"^^) 

for ah n sufficiently large, where the last inequality uses Lemmas 8.1 and 8.7. 
An exponential tail probability bound for Sj^. 

n 

Sj4 = -2Fk^ „(a)n-i ^[7v{Xjff3oj " E7v{X,f(3oj] 



i=l 
n 



^FKMj^h~'Y.<^^^^^(^J - f^Oj) 



A 



= 5^41 + 5j42 . 

Note that F^-^^{a) is uniformly bounded for /3i < a < /32 almost surely. 

From the argument of Lemma 3.1, E{tv{Xj)'^ (3qj) is uniformly bounded 
in Xj. Applying Bernstein's inequality to Sjn, there exists a positive con- 
stant cg such that P(|S'j4i| > Cn~'^/16) < exp(— cgn-^"^'^) for all n sufficiently 
large. On the other hand, by the Cauchy-Schwarz inequality, for all n suffi- 
ciently large, 

P(|5j42| > Cn-716) 
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1/2 



j;[7r(X,)^(3;-/3, 



Oj) 



i=l 



>C*n- 



1I/2 



n 



< P{[{f3^ - Poj) {rrMX,)7v{Xjy )(/3^. - f3o,)r > c 

< P(||3; - f3o,\\XUl{Pn7v{X,)n{X,f) > C*n~-) 

< P{X,n^,{Fn7T{Xj)7v{Xjf) > (62 + 1)A^"') 

for all n sufficiently large, where the last inequality uses Lemmas 8.1 and 8.7 
and C* denotes a generic positive constant which may vary from line to line. 
Therefore, for all n sufficiently large, 

P{\Sj4\ > Cn^yS) 

< 2N^ exp(-C6iV"^n^-2^) + 4exp(-C7ni-^^) + exp(-C8iV"^n^-2^) 
+ exp(— cgn ~ '^) 

< 2N'^ ex.p{-C6N~^n^-^^) + 5eiip{-C7n^-^^) + exp{-csN-'^n^-^^). 

Proof of Theorem 4.1. 

(1) We have 

' ||2 _ II f .||2| > r'n~'^\ 
ijWn WJnjW \^^n j 

<p(47V2exp(-C4iV-3n) + 17exp(-C7n^-^^) + 4exp(-C8iV"^ni-2^) 

+ 4iV2 exp(-C6iV~^n^~2^)) 
<p(17exp(-53n^"^^) + UN"^ e^p{-5iN-'^n^-'^^)) 

for all n sufficiently large, for some positive constants ^3 and 5^. 

(2) The result follows by making use of the bound in (1) and observing 
that 



P( max 



P{M^C M'^)>p(min\\r^^\\l>y, 



>Pfmin||/„,| 



■max|||/;j,||Ml/„,f|> 



l-P(niax|||7^,||2-||/„^.f|> min ||/„,-| 



> 1 — P ( max I 



jeAfa 



inj 



i;-ii/niri>cin-7i6 . 



D 
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SUPPLEMENTARY MATERIAL 

Supplement A: "Quantile-adaptive model-free variable screening for high- 
dimensional heterogeneous data" (DOl: 10.1214/13-AOS1087SUPP; .pdf). 
We provide additional technical details and numerical examples in the sup- 
plemental material. 
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